-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
9157 kw search update backfill folders for GitHub #9260
base: main
Are you sure you want to change the base?
Conversation
d26fe4b
to
d931d7b
Compare
await upsertFolderNode({ | ||
dataSourceConfig, | ||
folderId: d.internalId, | ||
parents: [d.internalId, ...d.parents, repoId.toString()], | ||
title: d.dirName, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing the intermediate "github-code-" thing, which does not exist in connectors database, but does exists in d.parents.
I also did not check the issues/discussion structure but i think we also have intermediate virtual folders here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no structure for issues, it's just a big database entry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new commit handles the "Code", "Issues" and "Discussions" nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 having a look, providing here first elements about PR description:
- for deploy: here you would write that you deploy connectors, then run the backfill (and in general any other steps, such as monitoring X or Y for risky deploys, etc.)
- on risk: since pretty risky, can you share the local tests you made here to ensure this works well? should include checking after various inserts / deletes, & running the backfill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks directionally good :)
Left a few coms
Also, can you explain how you detected all the places where we needed to upsert / delete folders?
await deleteFolderNode({ | ||
dataSourceConfig, | ||
folderId: repoId, | ||
loggerArgs: logger.bindings(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 points here:
- what led you to pass logger.bindings here? not sure what this does
- in general, this argument is now not necessary, you can remove it from the call and even delete it from
deleteFolderNode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because it is used in upsertDataSource. Indeed currently not used in deleteFolderNode (although it is a parameter), but thought that if we start using it, at least we don't have to modify this call to pass it. I'm ok with deleting it 👍
dataSourceConfig, | ||
folderId: githubCodeRepository.repoId, | ||
title: githubCodeRepository.repoName, | ||
parents: [githubCodeRepository.repoId], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget, parents[0] === folder_id as you changed in connectors/src/connectors/github/lib/hierarchy.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we are uploading a node for the repository itself, which is at root level, that's why it is its own id
As discussed IRL, we wait for https://github.com/dust-tt/tasks/issues/1785 to be done here, in order to avoid having to migrate folders too -- and we update PR accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create the following data_source_folders:
This is for Code Sync, but we also handle Issues and discussions via the :
|
Thanks ! Very clear 👍
|
Note: handling of the |
Description
Risk
Break Github connection, lose synced Github data
Tests
admin/cli.sh github code-sync
-> Made sure that the state is fine (correct parents, correct folder created/deleted) once the sync is triggered. Test cases include creating/deleting/updating issue, discussion, folder, repo
---> monitor
data_source_folders
,data_source_nodes
, and github tables (github_code_directories
,github_code_files
,github_code_repositories
, ...)For testing the backfill, add connection with main branch (no folder_node), then checkout and trigger sync -> check tables listed above.
Once everything works, do it with the Dust repo locally (because bigger)
Then we're ok to merge
Deploy Plan